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' Abstract 

(N 

, ^ ' We consider the estimation of a regression function with random design and heteroscedastic noise in 

a non-parametric setting. More precisely, we address the problem of characterizing the optimal penalty 
• when the regression function is estimated by using a penalized least-squares model selection method. In this 

' context, we show the existence of a minimal penalty, defined to be the maximum level of penalization under 

^ : which the model selection procedure totally misbehaves. Moreover, the optimal penalty is shown to be twice 

the minimal one and to satisfy a nonasymptotic pathwise oracle inequality with leading constant almost 
one. When the shape of the optimal penalty is known, this allows to apply the so-called slope heuristics 
initially proposed by Birge and Massart [14], which further provides with a data-driven calibration of 
penalty procedure. Finally, the use of the results obtained by the author in [30], considering the least- 
, squares estimation of a regression function on a fixed finite-dimensional linear model, allows us to go beyond 

■ the case of histogram models, which is already treated by Arlot and Massart in [6]. 

Keywords: Optimal model selection. Slope heuristics, Heteroscedastic regression, data-driven penalty. 

> ■ 1 Introduction 
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, Model selection by penalization has been the object of intensive research in the last decades. Given a collection 
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of models and associated estimators, two different tasks can be tackled: find out the smallest true model 
(consistency problem) , or select an estimator achieving the best performance according to some criterion, called 
a risk (efficiency problem). We only focus on the efficiency problem, where the leading idea of penalization, 
that goes back to early works of Akaike [T], |2] and Mallows is to perform an unbiased estimation of 
the risk of the estimators. FPE and AIC procedures proposed by Akaike respectively in [1] and [2], as well 
as Mallows' Cp or Cl [27 , aim to do so by adding to the empirical risk a penalty which depends on the 
dimension of the models. But the first analysis of such procedures had the drawback to be fundamentally 
asymptotic, considering in particular that the number of models as well as their dimensions are fixed while 
^ ' the number of data tends to infinity. As explained for example in Massart |28) . various statistical situations 

■ - - require to let these quantities depend on the number of data. Pointing out the importance of Talagrand's type 

concentration inequalities in this nonasymptotic approach, Birge and Massart |13) . |15| and Barron, Birge and 
Massart |5] have thus been able to build nonasymptotic oracle inequalities for penalization procedures that take 
into account the complexity of the collection of models. In an abstract risk minimization framework, which 
includes statistical learning problems such as classification or regression, many distribution-dependent and 
data-dependent penalties have been proposed, from the more general and thus less accurate global penalties, 
see Koltchinskii [22], Bartlett & aZ. f9], to the refined local Rademacher complexities in the case where some 
margin relations hold (see for instance Bartlett, Bousquet and Mendelson 10 , Koltchinskii [23 ). But as a prize 
to pay for generality, the above penalties suffer from their dependence on unknown or unrealistic constants. 
They are very difficult to implement and calibrate in practice and satisfy oracle inequalities with possibly 
huge leading constants. In the general purpose, there are other penalties such as the bootstrap penalties 
of Efron [19] and the resampling and F-fold penalties of Arlot [4] and [3]- These penalties are essentially 
resampling estimates of the difference between the empirical risk and the risk and can be used in practice 
since, in particular, they avoid the practical drawbacks of the local Rademacher complexities. Arlot [1], [5] 
also proves sharp pathwise oracle inequalities for the resampling and V^-fold penalties in the case of regression 
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with random design and heteroscedastic noise on histograms models, and conjectures that the restriction on 
histograms is mainly technical and that his results can be extended to more general situations. 

We address in this article the problem of optimal model selection, in a bounded heteroscedastic with random 
design regression setting. A penalty will be said to be optimal if it achieves a nonasymptotic oracle inequality 
with leading constant almost one, i.e. converging to one when the number of data tends to infinity. In the 
following we restrict ourselves to "small" collections of models, where the number of models is not more than 
polynomial in the number of data, a case where such an optimal penalty can exist. In more general settings, 
where the collection of models is large, one should gather the models of equal or equivalent complexity and 
derive an oracle inequality with respect to the infimum of the risk on the union of models with the same 
complexities, as explained in Birge and Massart pTT. This would allow to consider optimal penalties for large 
collections of models, but this problem is anyway beyond the scope of this article. Birge and Massart [14] have 
discovered in a generalized linear Gaussian model setting, that the optimal penalty is closely related to the 
minimal one, defined to be the maximal penalty under which the procedure totally misbehaves. They prove 
sharp upper and lower bounds for the minimal penalty and show that the optimal penalty is two times the 
minimal one, both for small and large collections of models. These facts are called by the authors the slope 
heuristics. The authors also exhibit a jump in the dimension of the selected model occurring around the value 
of the minimal penalty, and use it to estimate the minimal penalty from the data. Taking a penalty equal to two 
times the previous estimate then gives a nonasymptotic quasi-optimal data-driven model selection procedure. 
The algorithm proposed by Birge and Massart [14J to estimate the minimal penalty relies on the previous 
knowledge of the shape of the latter, which is a known function of the dimension of the models in their setting. 
Thus their procedure gives a data-driven calibration of the minimal penalty. Considering the case of Gaussian 
least-squares regression with unknown variance, Baraud, Giraud and Huet [7] have also derived lower bounds 
for the penalty terms for small and large collection of models, as well as Castellan '18[ in the case of maximum 
likelihood estimation of density on histograms where a lower bound on the penalty term is given only for small 
collections of models. Then the slope phenomenon has been extended by Arlot and Massart [6] in a bounded 
heteroscedastic with random design regression framework. They consider least-squares estimators on a "small" 
collection of histograms models. Heteroscedasticity of the noise allows them to validate the slope heuristics 
without assuming a particular shape of the penalty, and in particular to consider situations where the shape 
of the penalty is not a function of the dimension of the models. In such general cases, the authors propose 
to estimate the shape of the penalty by using Arlot's resampling or F-fold penalties, proved to be efficient in 
their regression framework by Arlot |3] and [4], in order to derive an accurate data-driven calibration of the 
optimal penalty. Moreover, their approach is more general than the histogram case, except for some identified 
technical parts of their proofs, thus providing with some quite general algebra that can be applied in other 
frameworks to derive sharp model selection results. The authors have also identified the minimal penalty as 
the mean of the empirical excess risk on each model, and the ideal penalty to be estimated as the sum of 
the empirical excess risk and true excess risk on each model. The slope heuristics then heavily relies on the 
fact that the empirical excess risk is equivalent to the true excess risk for models of reasonable dimensions. 
Arlot and Massart [6] conjecture that this equivalence between the empirical and true excess risk is a quite 
general fact in M-estimation, as well as, by rather direct consequence, the slope phenomenon for models not 
too badly chosen in terms of approximation properties. A general result supporting this conjecture is the 
high dimensional Wilks' phenomenon discovered by Boucheron and Massart |16j in the setting of bounded 
contrast minimization under margin conditions, where the authors derive concentrations inequalities for the 
true and empirical excess risk when the considered model satisfies some general condition on the moment of 
first order of the surpremum of the empirical process on localized slices of variance in the loss class. This 
assumption can be explicated under suitable covering entropy conditions on the model. Lerasle [25j proved the 
validity of the slope heuristics in a least-squares density estimation setting, under rather mild conditions on 
the considered linear models. The approach developed by Lerasle in this framework allows sharp computations 
and the empirical excess risk is shown by the author to be exactly equal to the true excess risk. Moreover, 
some improvements comparing to the technology of proofs given by Arlot and Massart i6j can be found in 
[25] . where Lerasle considers comparison between all pairs of models, allowing in particular a more refined use 
of the bias of the models. Lerasle also proves in the least-squares density estimation setting the efficiency of 
Arlot's resampling penalties, and generalizes these results for weakly dependent data, see [26^. Arlot and Bach 
[5] recently consider the problem of selecting among linear estimators in non-parametric regression. Their 
framework includes model selection for linear regression, the choice of a regularization parameter in kernel 
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ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. In such cases, the 
minimal penalty is not necessarily half the optimal one, but the authors propose to estimate the unknown 
variance by the minimal penalty and to use it in a plug-in version of Mallows' Cl- The latter penalty is proved 
to be optimal by establishing a nonasymptotic oracle inequality with constant almost one. 

In this article, we prove the validity of the slope heuristics in a bounded heteroscedastic with random 
design regression framework, by considering a "small" collection of finite-dimensional linear models, a setting 
that extends the case of histograms already treated by Arlot and Massart [6l . Two main assumptions must be 
satisfied. First, we require that the models have a uniform localized orthonormal basis structure in L2 (-P'^)i 
where is the law of the explicative variable X. This kind of analytical property describing the Loo- 
structure of the models has already been used in a model selection framework by Birge and Massart 13] and 
Barron, Birge and Massart [5] (see also Massart [IB]). Considering for example the unit cube of and taking 
~ Leb the Lebesgue measure on it, it is shown in Birge and Massart 13J that the assumption of localized 
orthonormal basis are satisfied for some wavelet expansions and piecewise polynomials uniformly bounded in 
their degrees. It is also known, Massart [28 , that in the case of histograms the property of localized basis in 
L2 {P^) is equivalent to the lower regularity of the considered partition with respect to P-^ , an assumption 
required by Arlot and Massart in |6]. Moreover, we show in [30] that if P^ has a density with respect to 
the Lebesgue measure on the unit interval that is uniformly bounded away from zero then, assuming the 
lower regularity of the partition defining piecewise polynomials of uniformly bounded degrees ensures that the 
assumption of localized basis is satisfied for such a model. The second property that must be satisfied in our 
setting is that the least-squares estimators are uniformly consistent over the collection of models and converge 
towards the orthogonal projections of the unknown regression function. Again, such a property is shown in 
[30] to be satisfied for suitable histograms and more general piecewise polynomial models. This allows us to 
recover the results of Arlot and Massart [61 with the same set of assumptions when the noise is uniformly 
bounded by upper and by below, and to extend it to models of piecewise polynomials uniformly bounded in 
their degrees. Taking advantage of the sharp estimates of the empirical and true excess risks for a fixed model 
given in [30] . our proofs then rely on the same algebra of proofs as those given in Arlot and Massart [6j. 

The article is organized as follows. We describe in Section [2] the statistical framework, the slope heuristics 
and the subsequent data-driven algorithm of calibration of penalties. We state in Section [3] our main results 
and derive their proofs in the remainder of the paper. 

2 Statistical framework and the slope heuristics 
2.1 Penalized least-squares model selection 

We assume that we have n independent observations = (XiYi) £ XxM. with common distribution P. The 
marginal law of Xi is denoted by P^ . We assume that the data satisfy the following relation 

= s4x,) + (t(x,)£. , (1) 

where e L2 (^P^^, £i are i.i.d. random variables with mean and variance 1 conditionally to Xi and a : 
X — >M. is an heteroscedastic noise level. A generic random variable of law P, independent of the sample 
(Ci,...,e„), is denoted by C = (Xr) . 

Hence, s* is the regression function of Y with respect to A", that we want to estimate. We are given a finite 
collection of models A^,i, with cardinality depending on the number of data n. Each model M G A^„ is 
assumed to be a finite-dimensional vector space, and we denote by Dm its linear dimension and sm the linear 
projection of s* onto M in (P^) ■ Furthermore, by setting K : L2 {P'^) — >■ Li (P) the least-squares 
contrast, defined by 

Ki.s) = {x,y)^{y-s{x)f , s G L2 (P^) , 
the regression fimction satisfy 

s* = are min PK (s) 

and for the linear projections Sm we have 

Sm = arg min PK (s) . 
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For each model M G A4n, we consider a least-squares estimator s„ (M), satisfying 



Sn (M) G argmin{F„(i^ (s))} 



arg min <( - > J - s {X^)f 



where P„ = n ^ X]r=i ^(Xi.Yi) is the empirical measure built from the data. We measure the performance of 
the least-squares estimators by their excess risk, 

(M)) :=P(i^s„(M)-Xs,) = ||s„ {M) ~ s,\\l 

1/2 

where ||s||2 = (/^ s^dP-^^ is the quadratic norm in L2 (P^)- Moreover, we have 

I (s*,s„ (M)) = I (s*, SAf) + I (sm,s„ (M)) , 

where the quantity 

I {s*,sm) ■■= P (Ksm - Ks^) = \\sm - s^Wl 

is called the bias of the model M and I {sm, Sn (M)) :— P {Ksn (M) — Ksm) > is the excess risk of the 
least-squares estimator s„ (M) on M . By the Pythagorean identity, we have 

Z(sM,s„ (M)) = |ls„ {M)-sm\\1 

and we prove sharp bounds for the latter quantity in [30j , based on the expansion of the least-squares contrast 
to the sum of a linear part and a quadratic part. 

Given the collection of models A^„, an oracle model is defined to be 

e arg min (s*, s„ (M))} (2) 

and the associated oracle estimator s„ (M^,) thus achieves the best performance in terms of excess risk among 
the collection {s„ (M) ; M S Unfortunately, the oracle model is unknown as it depends on the unknown 

law P of the data, and we propose to estimate it by a model selection procedure via penalization. Given some 
known penalty pen, that is a function from Mn to M-|_, we thus consider the following data-dependent model, 
also called selected model, 

M e arg min {P„ [Ks^ (M)) + pen (M)} . (3) 

Our goal is then to find a good penalty, such that the selected model M satisfies an oracle inequality of the 
form 

with some positive constant C as close to one as possible and with high probability, typically more than 
1 — Ln^^ for some positive constant L. 

2.2 The slope heuristics 

Let us rewrite the definition of the oracle model Af* given in ([2]). As for any M G A4n, the excess risk 
I (s*, Sn (A/)) = P [Ksn (A/)) — P {KSif) is the difference between the risk of the estimator s„ (Af ) and the risk 
of the target s*, and as P [Ks^,) is a constant of the problem, it holds 

A/* e arg min {P{Ksn (Af))} 

= arg min {P„ (Xs„ (A/)) + pen^d (M)} 

where for all A/ e >1„, 

peuid (Af) := P {Ksn (Af)) - P„ (Xs„ (Af)) . 
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The penalty function penj^j is called the ideal penalty, as it allows to select the oracle, but it is unknown 
because it depends on the distribution of the data. As pointed out by Arlot and Massart |1] , the leading idea 
of penalization in the efficiency problem is thus to give some sharp estimate of the ideal penalty, in order to 
perform an unbiased or asymptotically unbiased estimation of the risk over the collection of models, leading 
to a sharp oracle inequality for the selected model. A penalty term pen^p^ is said to be optimal if it achieves 
an oracle inequality with constant almost one, tending to one when the number n of data tends to infinity. 
Concerning the estimation of the optimal penalty, Arlot and Massart [6] conjecture that the mean of the em- 
pirical excess risk E [P„ {Ksm — Ksn {M))] satisfies the following slope heuristics in a quite general framework: 

(i) If a penalty pen : — > M+ is such that, for all model M G A^n, 

pen (Af) <{1-5)W. [P„ {Ksm - Ks,, (M))] 

with (5 > 0, then the dimension of the selected model M is "very large" and the excess risk of the selected 
estimator s„ is "much larger" than the excess risk of the oracle. 

(ii) If pen w (1 + (5)E [P„ {Ksm — Ksn {M))] with (5 > 0, then the corresponding model selection procedure 

satisfies an oracle inequality with a leading constant C {5) < +oo and the dimension of the selected 
model is "not too large" . Moreover, 

pen^pt « 2E [P„ {Ksm - Ks^ (M))] 

is an optimal penalty. 

The mean of the empirical excess risk on M, when M varies in Aim is thus conjectured to be the maximal value 
of penalty under which the model selection procedure totally misbehaves. It is called the minimal penalty, 
denoted by peUj^j^ : 

for all M e Mru pen^i„ (Af) = E [P„ {Ksm - Ks,, (M))] . 
The optimal penalty is then close to two times the minimal one, 

peiiopt w 2 pen^i„ . 

Let us now briefly explain why points (i) and (ii) below are natural. We give in Section [3] precise results which 
validate the slope heuristics for models such as histograms or piecewise polynomials uniformly bounded in 
their degrees. If the penalty is the minimal one, then for all M e Ain, 

P„ {Ksn (M)) + pen,„i„ (Af) 
= P„ {Ksn (A/)) + E [P„ (Ksm - Ksn (Af))] 

= P {Ksm) + {Pn - P) {Ksm) + (E [P„ {Ksm ~ Ks^ {M))] - P„ {Ksm - Ks^ (A/))) 
^P{Ksm) ■ 

In the above lines, we neglect (P„ — P) {Ksm) as it is a centered quantity and if the empirical excess risk 
P„ {Ksn (A/) — Ksm) is close enough to its expectation, then the selected model almost minimizes its bias, 
and so its dimension is among the largest of the models and the excess risk of the selected estimator blows 
up. As shown by Boucheron and Massart |16| . the empirical excess risk satisfies a concentration inequality in 
a general framework, which allows to neglect the difference with its mean, at least for models that are not too 
small. 

Now, if the chosen penalty is less than the minimal one, pen « {1 — 5) penj^^jj^ with 5 G (0, 1), the algorithm 
minimizes over A^„, 

P„(Xs„(M)) + pcn (Af) 

« P {Ksm) ~ SPn {Ksm - Ks^ (Af )) + (P„ - P) {Ksm) 

+ {l-d) (E [Pn {Ksm - Ks,, (M))] - P„ {Ksm ~ Ks^ {M))) 
« P {Ksm) - SPn {Ksm - Ksn (Af)) , 
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where in the last identity we neglect the deviations of the empirical excess risk and the difference between 
the empirical and true risk of the projections sm- As the empirical excess risk is increasing and the risk 
of the projection is decreasing with respect to the complexity of the models, the penalized criterion is 
decreasing with respect to the complexity of the models, and the selected model is again among the largest of 
the collection. 

If on the contrary, the chosen penalty is more than the minimal one, pen w (1 + (5) penj^jj^ with S > 0, then 
the selected model minimizes the following criterion, for all Af G A4n, 

Pn [Ksr, (A/)) + pen {M) - P„ (Ks,) 

« e {s,,sm) + 5Pn [KsM - Ksn (Af)) + (P„ - P) {KsM - Ks,) 
+ [l + 5) (E [P„ {KsM - KSn (Af))] - Pn [KsM - KSn (M))) 
^ e{s,,SM) + 5Pn{KsM - KSn{M)) , (4) 

So the selected model achieves a trade-off between the bias of the models which decreases with the complexity 
and the empirical excess risk which increases with the complexity of the models. The selected dimension will 
be then reasonable, and the trade-off between the bias and the complexity of the models is likely to give some 
oracle inequality. 

Finally, if we take (5 = 1 in the above case, that is pen « 2 x peUj^j^ and if we assume that the empirical excess 
risk is equivalent to the excess risk, 

Pn{KsM~Ksn{M))^P{Ksn(M)-KsM) , (5) 

then according to ([4|) the selected model almost minimizes 

P {KsM - Ks,) + Pn {KsM - KSn (Af )) w £ (s*, sm) + P {Ksn (Af ) - Ksm) ~ £ (s*, s„ (Af )) . 

Hence, 

l(s^,Sn (m)) «^(s*,S„ (Af,)) 

and the procedure is nearly optimal. We give in [30] some results showing that ([5]) is a quite general fact in 
least-squares regression. 



2.3 A data-driven calibration of penalty algorithm 

The slope heuristics stated in points (i) and (ii) in Section 12. 2[ include that a jump in the dimensions of the 
selected models should occur around the minimal penalty, which can be used to estimate the minimal penalty 
and by consequence, the optimal one. Let us denote by pen^jj^^p^ the shape of the minimal penalty which is, 
according to the slope heuristics, equal to the shape of the optimal penalty. Thus, for two unknown positive 
constants A„iin and A* depending on the unknown distribution of the data, we have 

Pen^in = ^min peu.hapo and pen^p^ = pen.hapc > 

where 

whenever the optimal penalty is twice the minimal one. We assume now that the shape of the minimal penalty 
is known, from some prior knowledge or because it has been estimated from the data, for example by using 
Arlot's resampling and F-fold penalties as suggested in Arlot and Massart [6j. Then, Arlot and Massart [6] 
propose to calibrate the optimal penalty by the following procedure and by doing so, they extend to general 
penalty shapes a previous algorithm proposed by Birge and Massart jl^ . 

Algorithm of data-driven calibration of penalties : 

1. Compute the selected model M [A) as a function of A > 0, 

M{A) e arg min {PnK (s„ (Af)) + Apen^^ape {M)) . 
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Find Ayain > such that the dimension DjTr/A) is "very large" for A < A,„i„ and "reasonably small" for 
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Select the model M = M 




In this paper, since our aim is not to apply the above algorithm in practice, we refer to Arlot and Massart 
[5] for a detailed presentation of the algorithm and to Baudry, Maugis and Michel [H] for an overview on 
the slope heuristics and further discussions on implementation issues. Data-driven calibration of penalties 
algorithms have already been applied successively in many statistical frameworks such as mixture models [29] , 
clustering [llj, spatial statistics (31], estimation of oil reserves [24] and genomics [32], to name but a few. 
These applications tend to support the conjecture of Arlot and Massart [6] that the slope heuristics is valid in 
a quite general framework. 



We state here our results that theoretically validate the slope heuristics in our bounded heteroscedastic re- 
gression setting. In particular, we recover the results stated in Theorems 2 and 3 of Arlot and Massart [B] for 
histogram models and extend them to models of piecewise polynomials uniformly bounded in their degrees. 
The proofs are postponed to the end of the paper, and heavily rely on results obtained in [3D] where we consider 
a fixed model, and on the general algebra of proofs developed by Arlot and Massart [6]. We state now the 
assumptions required to derive our results. 

3.1 Main assumptions 

Let us begin with the set of assumptions needed in the general case of models that are provided with localized 
basis in L2 (P^) ■ 

General set of assumptions : (GSA) 

(PI) Polynomial complexity of A^„: Card (A^„) < c^vm"-*^ . 

(P2) Upper bound on dimensions of models in there exists a positive constant Aj^^^ such that for every 
M e Mn, 1 < Dm < AM,+n{\nny^ < n . 

(P3) Richness of A^„: there exist Mq,Mi e Mn such that Dm^ e [y/n, CrichV^] ^i^d Dmi > Arichn {\nn)~'^ 

(Ab) A positive constant A exists, that bounds the data and the projections Sm of the target s, over the 
models M of the collection Mn- < A < 00, ||sj\/||^ < A < 00 for all M G Mn- 

(An) Uniform lower-bound on the noise level: a {Xi) > (Tj^in > a.s. 

(Apu) The bias decreases as a power of Dm- there exist > and C+ > such that 



(Alb) Each model is provided with a localized basis: there exists a constant tm such that for each M £ M 
one can find an orthonormal basis {<fk)k=i satisfying that, for all {Pk)k=i ^ K'^", 



3 Main Results 



i{s*,SM) < C+D 



M 




00 



where |/3| 



00 



max{|/3J ; k & {1, Dm}}- 
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(Acoo) Consistency in sup-norm of least-squares estimators: an event ftoo of probability at least 1 — , 

a positive constant Aeons, a positive integer ni and a collection of positive numbers {Ru.Dm) mem ^^i^t; 
such that 

n ^ Aeons i„\ 

sup Rn,DM < n (6) 

M£Mn vlnn 
and for all M G it holds on iloo, for all n > ni, 

||S„ (M) - SmIIoo <^n,DM ■ (7) 

We turn now to the set of assumptions needed for histogram models and models by piecewise polynomials, 
respectively. 

Set of assumptions for histogram models : 

Given some linear histogram model M G Ain, we denote by Pm the associated partition of X. 

Take assumptions (PI), (P2), (P3), (An) and (Ap„) from the general set of assumptions. Assume moreover 

that the following conditions hold true: 

(Ab') A positive constant A exists, that bounds the data: \Yi\ < A < oo. 

(Alrh) Lower regularity of the partitions: there exists a positive constant p such that. 



for all M e Mn, , \Vm\ inf (/) > p > . 

V IE'Pm ' 



Set of assumptions for piecewise polynomials models : 

In this case we take X = [0, 1], Leb is the Lebesgue measure on X , and given a linear model M £ A^„ of 
piecewise polynomials, we denote by Vm the associated partition of X. 

Take assumptions (PI), (P2), (P3), (An) and (Ap^) from the general set of assumptions. Assume moreover 
that the following additional conditions hold. 

(Ab') A positive constant A exists, that bounds the data: jy^l < A < oo. 

(Aud) Uniformly bounded degrees: there exists r e N* such that, for all M £ Ain, all / £ Vm and all p £ M, 

deg {p\i) < r . 

(AdLeb) Density bounded from upper and from below: has a density / with respect to Leb satisfying for 
some constants Cmin and Cmaxj that 

< < f [x) < Cma,^ < OO, Vx £ [0, 1] . 

(Alrpp) Lower regularity of the partition: a positive constant p exists such that, for all M £ Ain, 

< c^^^.Lcb < ./IPmI inf Lcb(/) < . 
The sets of assumptions will be discussed in Section [331 
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3.2 Statement of the theorems 

Theorem 1 Under the general set of assumptions (GSA) of Section lg.il for A-pcn S [0, 1) and Ap > 0, we 
assume that with probability at least 1 — ApU^"^ we have 

< pen (Ml) < ApenE [P„ {Ksm - Ks^ (Mi))] , (8) 

where the model Mi is defined in assumption (P3) of (GSA). Then there exist two positive constants Ai,A2 
independent of n such that, with probability at least 1 — Aiu^"^ , we have, for all n > Uq {{GSA) ,Apcn), 

Dj^>A2n\n (n)"^ 

and 

£ (s* , s„ (il?) ) > In (n) Jn^ {£ (s, , s„ (M))} . (9) 

Moreover, in the case of histograms and piecewise polynomials models, taking their respective set of assumptions 
defined in Section \3.1\ yields the same results. 

Thus, Theorem [T] justifies the first part (i) of the slope heuristics exposed in Section [^21 As a matter of fact, 
it shows that there exists a level such that if the penalty is smaller than this level for one of the largest models, 
then the dimension of the output is among the largest dimensions of the collection and the excess risk of the 
selected estimator is much bigger than the excess risk of the oracle. Moreover, this level is given by the mean 
of the empirical excess risk of the least-squares estimator on each model. 
The following theorem validates the second part of the slope heuristics. 

Theorem 2 Assume that the general set of assumptions (GSA) of Section \S.l\ holds. 

Moreover, for some S € [0, 1) and Ap,Ar > 0, assume that an event of probability at least 1 — Apn~'^ exists on 
which, for every model M E A4n such that Dm > Am,+ (lii?!)"^, it holds 

(2 - (5) E [P„ [KsM - Ks^ (M))] < pen (M) < (2 + <5) E [P„ {Ksm - Ks„ (M))] (10) 

together with 

pen(M) < A^-^^^^ (11) 
n 

for every model M G 7W„ such that Dm < Am,+ {^nnf . Then, for i > 77 > (l — /3_,_)^ /2, there exist a 
positive constant A^ only depending on cm given in (GSA) and on Ap, a positive constant A4 only depending 
on constants in the set of assumptions ( GSA ), a positive constant A^ only depending on constants in the set 
of assumptions ( GSA ) and on Ar and a sequence 



Ai (1 V ^/A^^ 

such that with probability at least 1 — ^371"^, it holds for all n > uq {{GSA) , rj, S), 



0„ = A, sup |e„ (M) , Am,+ {Inn)' < Dm < n^+V^j < (1 V VAco».) ^^^^ 
MeM„ ^ (Inrt) ' 



and 

\ + S 5((lnn)-' + 0„j \ , (Inn)' 



^(s„s„(m)) < ^ , W(5„s„(M0)+v45^^ . (13) 



Assume that in addition, the following assumption holds, 

(Ap) The bias decreases like a power of Dm '■ there exist /3_ > > and C+. C- > such that 

C-Dlf- <e{s,,SM)<C+Dll'+ . 
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Then it holds with probability at least 1 — A^n ^, for all n > tiq (^{GSA) , C_, rj, 5) , 



Am,+ (Innf <i?j^<n''+i/2 



and 



1 + 5 56, 



(14) 



1-5 {l-Sf 



Likewise, in the case of models of histograms and piecewise polynomials, taking their respective set of assump- 
tions defined in Section \3.1[ together with assumption and, for the second part of the theorem, assumption 
(Ap), yields the same results. 

The quantity e„ (Af ) used in (|12p controls the deviations of the true and empirical excess risks on the model 
M and is more precisely defined in Remark [3] above. From Theorems [T] and [21 we identify the minimal penalty 
with the mean of the empirical excess risk on each model, 



Moreover, Theorem [5] states in particular that if the penalty is close to two times the minimal procedure, 
then the selected estimator satisfies a pathwise oracle inequality with constant almost one, and so the model 
selection procedure is approximately optimal. 

3.3 Comments on the sets of assumptions 

Let us now explain the sets of assumptions given in Section l3Tl Assumption (PI) states that the collection 
of models has a small complexity, more precisely a polynomially increasing one with respect to the amount of 
data. For this kind of complexities, if one wants to perform a good model selection procedure for prediction, 
the chosen penalty should estimate the mean of the ideal one on each model. Indeed, as Talagrand's type 
inequalities for the empirical process are pre-Gaussian, they allow to neglect the deviations of the quantities of 
interest from their mean, uniformly over the collection of models. This is not the case for too large collections 
of models, where one has to put an extra-log factor depending on the complexity of the collection of models 
inside the penalty (see for example [13] and [S]). In assumption (P2) we restrict the dimensions of the models 
by upper, in a way that is not too restrictive since we allow the dimension to be of the order of the amount 
of data within a power of a logarithmic factor. We assume in (P3) that the collection of models contains 
a model Mq of reasonably large dimension and a model Mi of high dimension, which is necessary since we 
prove the existence of a jump between high and reasonably large dimensions. We demand in (Ap^) that 
the quality of approximation of the collection of models is good enough in terms of bias. More precisely, we 
require a polynomially decreasing of excess risk of linear projections of the regression function onto the models. 
Assumptions (Ab), (An), (Alb) and (Acqo) essentially allow us to apply results of [30], as further explained 
in Remark [3] below. The assumption (Ab) is also necessary to control in the proofs the empirical bias term 
centered by the true bias by using Bernstein's inequality (see Lemma [5]). 

Assumption (Ab') implies in the histogram case assumption (Ab), see Section 4 of [30 . Moreover, assumption 
(Alrh) allows us in this case to deduce assumptions (Alb) and (Acoo) of the general set of assumptions (see 
Lemma 5 and 6 of [30j ) . Moreover, using Lemma 6, it is straightforward to see that in the histogram case we 
have 



where Aeons is a uniform positive constant over the models of A^„. We obtain in the case of histograms the 
same set of assumptions as given in Arlot and Massart [6]. Arlot and Massart [6] also notice that they can 
weaken assumptions (Ab') and (An), for example by assuming conditions on the moment of the noise instead 
of considering that this quantity is bounded in sup-norm. This latter improvement seems to be beyond the 
reach of our method, due to the use of Talagrand's type inequalities that require conditions in sup-norm. Arlot 
and Massart [B] also show that the condition (Ap^) is satisfied when X cMf' and the regression function is 
a-Holderian. Moreover, they show that (Ap) is satisfied when in addition, s* is non-constant with respect to 
the sup-norm. 



pen^i, (M) = E [P„ {Ksm - Ks^ [M)) 
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As in the case of histogram models, assumption (Ab') imphes in the piecewise polynomial case assumption 
(Ab), see Section 5 of 30 . Assumptions (Aud), (AdLcb) and (Arpp) allow us to guaranty the statements 
(Alb) and (Acqo) of the general set of assumptions in this case (see Lemmas 8 and 9 of _30 ). Moreover, we 
still have 



Ru.Dm ^ 



Dm Inn 



within a uniform constant over the models of A^„. It is well-known that piecewise polynomials uniformly 
bounded in their degrees have good approximation properties in Besov spaces. More precisely, as stated in 
Lemma 12 of Barron, Birge and Massart [8], if A" = [0, 1] and the regression function s, belongs to the Besov 
space -Ba,p, oo (X) (see the definition in [5]), then taking models of piecewise polynomials of degree bounded by 
r > Of — 1 on regular partitions with respect to the Lebesgue measure Leb on X, and assuming that has a 
density with respect to Leb which is bounded in sup-norm, assumption (Ap„) is satisfied. It remains to find 
conditions in this context such that the lower bound on the bias in (Ap) is also satisfied. 

Remark 3 Since constants in the general set of assumptions ( GSA ) made above are uniform over the col- 
lection Ain, we deduce from Theorem 3 of [30f applied with a — 2 + um — = A_a4,+ that if 
assumptions (P2), (Ab), (An), (Alb) and (Acoo) hold, then a positive constant Aq exists, depending on 
o^Mi Am,+ o,nd on the constants A, CTmin o^'^d r^vi defined in the general set of assumptions, such that for all 
M Cz A4n satisfying 

< AM,+ ilTinf <Dm , 

by setting 



En (M) = Aq max ■ 



/ In n \ 



1/4 



Dm In n 



1/4 



,-Da 



we have, for all n > uq {Am,+ , A, Aeons, ni, rM, (Tmin, o^m) , 



(1 - £„ (A/)) i^/C? J,, < P (Ksn (M) - Ksm) < (1 + £„ (M)) \^JCl m 



and 



ID 



1 D 



1 - 4 (A/)) -^—JCIm < Pn {Ksm - Ks^ (A/)) < (l + (M)) ^^/C? 



4 n 

Moreover, for all M Cz Ain, we have by Theorem 4 of 
and aM and for all n > hq {Aeons, n-i). 



4 n 



M 



> 1 - lOn" 



P {Ksn (Af) ~ Ksm) > A^ 



for a positive constant A„ depending on A, Ac 
D M V In n 



and 



P Pn{KsM~KSn{M)) > A^- 

The remainder of this paper is devoted to the proofs. 



Dm V In n 



< 3n 



< 3n" 



(15) 

(16) 
(17) 

(18) 
(19) 



4 Proofs 

Before stating the proofs of Theorems [5] and [TJ we need two technical lemmas. In the first lemma, we intend 
to evaluate the minimal penalty E [P„ {Ksm — Ksn {M))] for models of dimension not too large and not too 
small. 

Lemma 4 Assume (P2), (Ab), (An), (Alb) and (Ac^) of the general set of assumptions defined in Section 
Then, for every model M G Mn of dimension Dm such that 

< Am,+ (Inn)^ < Dm , 
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we have for all n > uq {Am,+ , A Acons,ni,rM,crinin,aM), 

(1 - LA^,+ ,A,.„,.„,r^,a^4 (M)) ^ICIm < E [Pn {KSM " KSn (M))] (20) 
< (1 + LAM, + ,A,T^ir,.rM.aM£n {M)) -^^l.M y (21) 

where e„ (M) = Aq max | ; ( ^"J"" ) ; ^RuMm | de/ined «n Remark\^ 

Proof. As explained in Remark [31 for all n > hq {Am,+, A, Acons,ni,rM,o-niini ctM), we thus have on an 
event fii (M) of probabihty at least 1 — 5n^^^"-^, 

(1 - e„ (M)) /C? M < Pn {KsM ~ Ksn (M)) < (1 + e„ (M)) ]—ICj ^ , (22) 

where e„ (M) = max | (is^) ; (DM^y/^ . ^/r^^ . Moreover, as \Y,\ < A a.s. and ^mIL < ^ 
by (Ab), it holds 

1 " 

< Pn (KsM - KSn (M)) < PnKsM = " (i"» " (^/))' < 4^' (23) 

1=1 

and ds Dm > 1, we have 

e„ (M) = A, max | (^^) ; (^^) ; v/^^Z^j > A^n'^^ . (24) 



We also have 



¥.[Pn{KsM-Ksn {M))] 

= E [Pn [KsM - KSn {M)) lm(M)] + E [Pn [KsM - KSn (Af)) l(J2,(Af))-] • (25) 

Now notice that by (An) we have JCi^m > 2(Tniin > 0. Hence, as Dm > 1, it comes from ((23)) and (|24l) that 

< E [P„ (KsM " i^s„ (M)) l(ni(Af))=] < 20A^n-^-^^ < -j^^el (M) -^/C? ^ ■ (26) 
Moreover, we have e„ (Af ) < 1 for all n > rig {Aq, Am,+, Aeons), so by ([22|) . 

< (1 - 5n-2-"A.) (1 „ ^2 ^ICIm < E [P„ (i^SM - Ks,, (M)) 1o,(m)] (27) 

< (1 - 5n-^-"-) (1 + (M)) ^ICIm ■ (28) 

Finally, noticing that n~'^~°'^ < A^^eliM) by we use ([Ml), dSll) and ([281) in dSS]) to conclude by 

straightforward computations that 

80A^ 2 

^O'^min 

is convenient in ([20]) and (PTj) . as only depends on a>i, A, CTmin and r_\4. ■ 
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Lemma 5 Let a > 0. Assume that (Ab) of Section \3.1\ is satisfied. Then a positive constant Ad exists, 
depending only in A, Aj^^^, CTmin md a such that, by setting S {M) = (P„ — P) {Ksm — Ks^), we have for 
all M e Mn, 



\~6{M) \ > Ad 



£ (s*, SAf) Inn ^ Inrt 
n n 



< 2n 



(29) 



If moreover, assumptions (P2), (Ab), (An), (Alb) and (Acoo) of the general set of assumptions defined in Sec- 
tion \3.1\ hold. then for all M G Mn such that Am.+ i^T^n) < Dm and for alln > uq {Am,+ , A, AconsTniTrMi o'mim ct), 
we have 

£{s*,sm) . Inn 



S (M) > 



Ad- 



M 



M 



E [P2 (A/)] < 2n'" , 



(30) 



where P2 (Af) P„ {Ksm - Ksn (M)) > 0. 



Proof. We set 



f , ^ 8^2 8A^a WA'^a 
Ad = max <^ AAy/a; -^a; h — 



(31) 



Since by (Ab) we have |y| < Aa.s. and ||s*||^ < A it holds Ws^^ = ||E[r |X]||^ < A, and so \\sm - s*IL < 
2A. Next, we apply Bernstein's inequality (IMl) to S (M) = (P„ — P) {Ksm — Ks^,) . Notice that 



K (sm) {x, y) - K (s*) {x, y) = (sm {x) - s* (x)) (sm {x) + (x) - 2y) , 

hence Hii'sA/ - ^s*|loo < ^A^. Moreover, as E [y - s^ {X) \X] = and E {Y - {X)f \X 
we have 

\ksm{X, Y)-Ks, {X, Y)Y 



< 



(2A) 



E 

= E 
< 8A^ 



4 (r - (X))^ + (sm (X) - s, iX)y (sm (X) - s, (X)) 



{sm (x) - s, {x)y 



= 8A^e{s,,SM), 
and therefore, by (|96p we have for all a; > 

P |^|5(Af)| > 

By taking x = alnn, we then have 



mA^£{s^,SAi)x 8A^x 
n 3n 



< 2 exp {—x) 



1(5 (Af) I > 



16A2a£(s»,SM)lnn 8A^alni 



A^ 



3n 



< 2n-" 



(32) 



which gives the first part of Lemma [S] for Ad given in (PT|) . Now, by noticing the fact that < ar] + bri ^ 

for all 77 > 0, and by using it in ([5^ with a = ^ (s*, sa/), i> = ^nd 77 = -D^/^^ , we obtain 



P \5{M)\ > 



eis.,SM) 



M 



A a Inn 



< 2n^" 



(33) 



Then, for a model M G Ain such that Am.+ (Inrt)^ < Dm, we apply Lemma|4]and by ((20|) . it holds for all 
n>no {AM,+ ,A,Acons,ni,rM,crmin,aM), 



D 



4n 



(34) 
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where £„ = max | (l^) , {2m^)'^\ v/^„,D„,.a|. Moreover as Dm < ^^4,+" (Inn)"' by (P2), 

Rn,DM < Aeons (Inn)^^^^ by ([6]) andylA4^+ (Inn)^ < Dm, we deduce that for aUn > tiq {Am,+ , A, Acons^rMi'^mim olm), 

LAM.-,A,a^i„,rM,aM^n {^'^) ^ 1/2 • 

Now, since /Ci,m > ZcTmin > by (An), we have by (jMl), E [p2 (M)] > ^^-^ for all n > tiq {Am,+ ,A, Aeons, ni,rM,crmin, ^m] 
This allows, using (1551) . to conclude the proof for the value of Ad given in by simple computations. ■ 
In order to avoid cumbersome notations in the proofs of Theorems [2] and [l] when generic constants L and uq 
depend on constants defined in the general set of assumptions stated in Section [3. 11 we will note i^Qg^) ^-nd 
no((GSA)). 



Proof of Theorem [2j From the definition of the selected model M given in ([3]), M minimizes 

crit (M) P„ {Ksn (M)) + pen (M) , (35) 
over the models M e Ain- Hence, M also minimizes 

crit' (A/) := crit (M) - P„ (Ks^) . (36) 

over the collection A4n- Let us write 

e {s,,Sn (M)) = P {KSn (M) ~ Ks,) 

= Pn {KSn [M)) + P„ {KSM - Ksn (M)) + (P„ - P) (Xs, - Ksm) 
+ P (Xs„ (M) - Xsm) - P„ {Ks,) . 

By setting 

Pi {M) = P (if s„ (Af) - Ksm) , 
P2 (M) = P„ (ifsM - ifs„ (A/)) , 
5(Af) = (P„-P) (ifsM-ifs,) 

and 

penid (A/) = pi (Af ) + (Af ) ~ 5 (Af ) , 

we have 

£ (s„ s„ (A/)) = P„ (if s„ (Ai)) + pi (A//) + p2 (Ai-) - J (Ai) - P„ (if s,) (37) 

and by dMl), 

crit' (Ai) = I (s,, s„ (Ai)) + (pen (Ai) - penj^ (Ai)) . (38) 

As Ai minimizes crit' over A^„, it is therefore sufficient by ([55]). to control pen (Ai) — pen|j (Ai) - or equivalently 
crit' (Ai) - in terms of the excess risk i (s,, s„ (Ai)), for every Ai g in order to derive oracle inequalities. 
Let VLn be the event on which: 

• For all models Ai e A^„ of dimension Dm such that Am.+ (Inn)^ < Dm, (|TOt hold and 



|pi (Ai) - E [p2 (M)] I < i(GSA)en (M) E [p2 (M)] (39) 

|P2 (M) - E [P2 (M)] I < i(GSA)4 (A^) E [P2 (M)] (40) 

|5(Ai)| < iif^ + L(«,^)i^E[p2(Ai)] (41) 
V Um vUm 

I^W|<W)(/?^^^^^^+^) (42) 
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For all models M G A4n of dimension Dm such that Dm < (Inn) , dTTI) holds together with 



|S(M)|<L,„..,^ /""7"°" + !!l^j (43) 
p,(M)<L,,s«:eii:^<i,„,,,(!liIil! (44) 

Pi M < L(GSA) < i(GSA) ^ 45 

By (fT6l) . (fT7|). (fT8| and ([T9l) in Remark [H Lemma IH Lemma [5] applied with a = 2 + awi, and since ([T0| holds 
with probability at least 1 — Apn~^, we get for all n > uq ((GSA)), 



MeM„ 



P-CM 



Control on the criterion crit^ for models of dimension not too small: 

We consider models M e A^„ such that Am,+ (Inn)'^ < Dm- Notice that (HIT) implies by ([T5|) that, for all 
M e Mn such that Am,+ {\nnf < Dm, for ah n>no ((GSA)), 

\~SiM)\ < i(GSA)f^^-^l xE[£{s.,SM)+P2iM)] 
< L(GSA)£„(Af)IE[^(s*,.SM)+P2(M)] , 
so that on ri„ we have, for all models Af e 7M„ such that Am.+ (Inn)"^ < Dm, 

Ipen^d (M)-pen(Af)| 

< \pi {M)+p2 (M) - pen ( Af ) I + I ^ (Af ) | 

< \pi (A/) + p2 (A//) - 2E [p2 (A/)] I + [p2 (Af)] + L(GSA)en (M) E [£ (s„ sm) + P2 (M)] 

< i(GSA)£n (M) E [p2 (A/)] + ,5E [p2 (M)] + L(GSA)en (Af ) E [i (s, , sm) + P2 (Af)] 

< + i(GSA)£n (A/)) E (S„ SAf) + P2 (A/)] . (46) 

Now notice that using (P2) and ^ in ^T5\i gives that for all models M € A4n such that Am,+ (Inn)"^ < Dm 
and for all n > uq ((GSA)), < i(GSA)£ri (A/) < i. As ^ (s*, Sn (M)) — I {s^.,sm) +Pi (A/), we thus have on 
ilr,, for all n > no ((GSA)), 

0<E[^(s,,sm)+P2 (A/)] 
<£{s,, s„ (Af )) + |pi (Af) - E [p2 (M)] I 

<^(..,.„(Af))+ /7^^)""^^^'] pi(M) by® 

J- - -t^ (GSA) En l^W j 
-L ~ ^(GSA)^™ (A'i j 

< (1 + i(GSA)en (Af )) ^ (S* , Sn (A/)) . (47) 

Hence, using (ITf|) in PS|) . we have on i7„ for all models M G such that Am,+ (Inn)'^ < Dm and for all 
n>no((GSA)), 

IpenJd (A/) - pen (Af )| < {5 + L(gsa)£„ (Af )) ^ (s*, s„ (A/)) . (48) 

By consequence, for all models Af e Mn such that Am,+ (Inn)'^ < Dm and for all n > no ((GSA)), it holds 
on ri„, using ([25]) and 

(1 - (5 - i(GSA)e« (M)) £ (s* , sn (Af)) < crit' (Af ) < (l + (5 + i(GSA)e« (Af )) ^ (s* , s„ (Af )) . (49) 
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Control on the criterion crit^ for models of small dimension: 



We consider models M e Mn such that Dm < Am,+ (Inn)^. By dTT]), (|33]) and it holds on VL^, for any 
r > and for aU M G Mn such that Dm < Am.+ (Inn)^, 

IpenJd (M)-pen(M)| 

< pi (M) + p2 (M) + pen (Af ) + 15 (M) I 



(Inn)^ (Inn)^ / (s*, SAf ) In n lnn\ 

S -t^(GSA) 1- A- ^^-^^(GSA) V \ 

n n \ V n n I 

(Inn)^ \ / -1 \ In'T- 

^ ' n n 

(Inn)^ , . ^^ / 1 \ Inn , , 

<i(GSA),A.^^+T^(s*,S„(Af)) + (T-l + l)L(GSA)— • (50) 

Hence, by taking r = (lnn)~^ in ([SO)) we get that for all Af e such that Z?a/ < Am.+ (Inn)"^, it holds on 

|pen;,(Af)-pen(A//)| < ilfflfliij^ll + L(gsa),a. • (51) 

(In n) n 

Moreover, by (155]) and ([?T|) . we have on the event for all A/ e A^„ such that D^/ < ^a^,+ (Inn)^, 

(l - (lnn)-2) I (s„ s„ (Af)) - L(gsa),a.^^^ < crit' (Af) (52) 
(l + (lnn)-')£(s*,s„(Af)) + L(GSA),A.-^^^ ■ (53) 

Oracle inequalities: 

Recall that by the definition given in ([2]), an oracle model satisfies 

Af* € arg min {£ (s,, s„ (A//))} . (54) 

By Lemmas[n]and[7]below, we control on f2„ the dimensions of the selected model A/ and the oracle model Af*. 
More precisely, by (155)) and we have on f2„, for any i > ry > (l — /3+)^ /2 and for all n > no ((GSA), 77, S), 

Dr, < n'/^+^' , (55) 
Dm. < ni/2+'7 . (56) 



< 



Now, from ([55]) we distinguish two cases in order to control crit' yMj. If Am.+ (Inn) < D-g^ < 71^/^+'', we 
get by (gg), for all n > no ((GSA)), 

crit' (m) > (1 - 5 - L(GSA)£n (m) ) ^ (s, , s„ (m) ) . (57) 
Otherwise, if Dj^ < Am,+ (Inn)'^, we get by ((52|) . 

(1 - (Inn)-') I (s„s„ (m)) - L(GSA),A.-^^^ < crit' (m) . (58) 
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In all cases, we have by ([57]) and (|58p . for all n > no ((GSA)), 

crit' (m) > ( 1 - 5 - (Inn)"' - L(gsa) sup £„ (M) ) £ (s,, s„ (m)) 

^ ^ V J\/eA^„, ylAi,+ (lnn)3<_DM<"i/=+'' / ^ V // 

-L(GSA).A^ ■ 59 

n 

Similarly, from ([56| we distinguish two cases in order to control crit' (M,). II Am,+ (Inn)"^ < < 
we get by (gSI), for all 71 > 710 ((GSA)), 

crit' (A/*) <{l + 6 + L(GSA)en (M*)) ^ (s*, Sn (M*)) - (60) 

Otherwise, if Dm, < Am,+ (Inn)'^, we get by ([SS]) . 

crit'(M,) < (l + (lnn)-2)£(s„s„(M,)) + L(GSA),A.^^^ • (61) 
In all cases, we deduce from (pO)) and (|6ip that we have for all n > uq ((GSA), (5), 

crit' (M,) < I 1 + 5 + (hiTi)"' + i(GSA) sup £„ (M) I £ (s,, s„ (M*)) 

\ MeM„, A^, + (lnn)3<Djf <ni/2+>7 y 

^ ' n 

Hence, by setting 

0n = i(GSA) X sup £„ (Af ) , 

MeA4„, AAi,+ (lnTi)3<_Dj/<Tii/2 + .) 

we have by p5|) and for ah n > uq ((GSA), r/, 5), 

< ^^"^^fl , {\nny^ + en + S<l , (In7i)-V0„ < i-^ 
(Inrt) ' ^ 

and we deduce from ((591) and dM]), since < 1 + 2a; for all x E [O, i), that for aU ti > tiq ((GSA), t?, 5), it 
holds on fin, 



l + S+ilnn) +^n\^f fnrw t I^(GSA) ,Ar (Inn ) 



i (s.,s„ (m)) < + + " £ is., (M,)) + 

^ ^ ^ ^ V 1 - - Inn - / 



3 



< 



' ^ ^ ' ns*,Sn(M*))+i(GSA),A.^^ . (63) 



2 



Inequality ([13)) is now proved. 

It remains to prove the second part of Theorem [21 We assume that assumption (Ap) holds. From Lemmas [H 
andlll we have that for any ^ > r/ > (l — /3+)^ /2 and for all n > no ((GSA), C_, ?], J) , it holds on f2„, 

Am.+ (In nf <Dj^< n'/^+'^ , (64) 

Am,+ (Inn)^ < Dm, < 71^/'+" . (65) 

Now, using (|57l) and ([5D[) . by the same kind of computations leading to ([M[) . we deduce that it holds on r2„, 
for all n > no {{GSA),C-, f3_,r],S), 



£(^s„sn (m)) < (^ii|i^^£(s*,s„(M,)) 



1 + (5 5e„ 
1-5 ^ (1-5) 

Thus inequality ([T4[) is proved and Theorem [2l follows 
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Lemma 6 (Control on the dimension of the selected model) Assume that the general set of assump- 
tions ( GSA ) hold. Let r/ > (l — /?+)^ /2. If n > no{( GSA ), 7], 6) then, on the event fin defined in the proof 
of Theorem\^ it holds 

Dj^ < . (66) 

If moreover (Ap) holds, then for all n > uq {^(GSA),C- , 13 _,ri,5) , we have on the event il„, 

Am,+ (Inn)^ <Dj^< n}l''^'> . (67) 

Lemma 7 (Control on the dimension of oracle models) Assume that the general set of assumptions 
(GSA) hold. Let 77 > (l — /3_|_)^ /2. If n > no {(GSA),ri) then, on the event f2„ defined in the proof of 
Theorem\^ it holds 

Dm. < rii/2+'' . (68) 
If moreover (Ap) holds, then for all n > uq (^(GSA),C^,f3_,ri^, we have on the event fin, 

Am,+ {\nnf < Dm, < n^^^+'' ■ (69) 
Proof of Lemma [6] Recall that M minimizes 

crit' (M) = crit (M) - PnKs^ = ^ (s,, sm) - P2 (M) + 5 (M) + pen (M) (70) 
over the models M G Mn- 

1. Lower bound on crit' (M) for small models in the case where (Ap) hold : let M G Ain be such that 
D^i < (Inn) . We then have on fin, 

e {s*,sm) > C^A)^-^ (Inn)-'^- by (Ap) 
pen (M) > 

fill Jl)^ 

P2 (M) < L^GSA}- from dMl) 



S (M) > -L(GSA) \/ + from (gS 



Since by (Ab), we have < i{s^,,SM) < -^A^ , we deduce that for aU n > uq ((GSA), C_, ,9„) , 

C^A~^- 

crit' (M) > ^ (Inn)"^''- . (71) 



2. Lower bound for large models : let M G A4n be such that Dm > ri^/^+''. From (fTO|) and (j40|) we have 

on fin, 

pen (M) - p2 (M) > {l - 6 - L(gsa)4 (M)) E [p^ {M)] . 

Using (P2), del and the fact that Dm > n^/^+'^ in we deduce that for ah n > no ((GSA), 77, 5), 

i(GSA)£n (M) < i (1 — (5) and as by (An), JCi,m > 2cri„in we also deduce from Lemma S] that for all 

n > no ((GSA), 7]), E[p2 (M)] > By consequence, it holds for all n > no ((GSA), r;, (5), 

pen (M) - p2 (M) > ^ (1 - ,5) — . (72) 
4 n 

From (HI) it holds on fl„ , 



^(Af)>-W) (^ f ^^-7^'^" +^] . (73) 

Hence, as Dm > n^/^+'f and as by (Ab), 0<i (s,, sa/) < 4^1^, we deduce from ^ and ^ that 
we have on fin, for all n > no ((GSA), 77, 6), 

crit' (M) > (1 - 5) i(GSA)«-'/'+'' • (74) 
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3. A better model exists for crit' (M) : from (P3), there exists Mq G A4n such that y/n < Dmq < CrichV^- 
Then, for all n > no ((GSA),?]), 

Am.+ {Innf < \/n < Dmo < Crich\fn < n^^^^'' . 

Using (Ap„), 

£{s,,.SMo) <C+n-^+/^ . (75) 
By (|4T|) . we have on fin, for all n> uq ((GSA),?]), 

\HMo)\ < i%^+L(GSA)^^Eb2(Mo)] (76) 

and by pllj) . 

pen (Mo) < 3E [p2 (Mq)] . 

Hence, as /Ci,m < and ^(s*,sa/o) < by (Ab) and as for all n > uq ((GSA)) e„ (M) < 1, we 
deduce from inequalities ([75]) . ([75]) and Lemma H] that for all n>no ((GSA), 77), 



and 

pen (Mo) < L(GSA)f^"^^^ • 
By consequence, we have on Qn, for all n > uq ((GSA), ry), 

crit' (Mo) < £ (s„ SMo) + \~HMo)\ + pcn (Mo) 



(^„-^+/2+„-i/2) . (77) 



To conclude, notice that the upper bound ([77)1 is smaller than the lower bound given in ((74|) for all n > 
no ((GSA), 77, 5). Hence, points 2 and 3 above yield inequality Moreover, the upper bound ([77]) is smaller 
than lower bounds given in ([7T|) . derived by using (Ap), and ([7^ . for all n > uq ((GSA), C_, /3_, 77, (5) . This 
thus gives (j67p and Lemma [S] is proved. H 
Proof of Lemma [T] By definition, A/* minimizes 

^(s,,s„ (M)) =^(s,,sm)+Pi (M) 

over the models M e A4n- 

1. Lower bound on £ (s*, s„ (M)) for small models : let M E A4n be such that Dm < Am,+ (Inrt)'^ . In this 
case we have 

£{s^,,Sn{M))>t{s,,SM)>C-A]^;_^{\nn)-^^- by (Ap). (78) 

2. Lower bound of £ (s*, s„ (M)) for large models : let M e Mn be such that Dm > n^^^^^- From ([5^ we 
get on ri„, 

Pi (M) > (1 - i(GSA)en (Af )) E b2 (Af )] . 

Using (P2), dH) and the fact that Dm > n^/'^+'^ in (HI]), we deduce that for all n > no ((GSA), 77), 
i(GSA)£n (Af) < ^ and as by (An), /Ci_a/ > SfJmin we also deduce from Lemma 0] that for all 77, > 

7io ((GSA), 77), E [p2 (A/)] > By consequence, it holds for all n > no ((GSA), 7/), on the event 

£ (s*, s„ (M)) > pi (M) > ^£e. > ^^-1/2+^ , (79) 
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3. A better model exists for £(s*,s„(M)) : from (P3), there exists Mq G M.n such that ^/n < Dmq < 
Crichy/n. Moreover, for all n > uq ((GSA), r/), 

Am,+ (Inn)^ <V^< Dmo < c„chV^ < n^/^+'j . 

Using (Ap„), 
and by ^ 

Pi (Mo) < (1 + i(GSA)en (M)) E [p2 (Mo)] 

Hence, as /Ci,a/ < 6A by (Ab) and as, by ^ and ([15]), for ah n > hq ((GSA)) it holds £„ (M) < 1, we 
deduce from Lemma |4] that for all n > uq ((GSA)), on the event ri„. 

Pi (Mo) < ^(GSA)^^ < -^(GSA)""^^^ • 

By consequence, on r2„, for all n > uq ((GSA)), 

e (s,, s„ (Mo)) = e {s,,SMo)+Pi (Mo) 

<i(GSA)(n-'^+/'+r^-l/2) . (80) 

The upper bound ([50]) is smaller than the lower bound (|79p for all n > no ((GSA), t;), and this gives 
((55)) . If (Ap) hold, then the upper bound (|M| is smaller than the lower bounds (175)1 and ([75)1 for aU 
n > no ((GSA), C_ , /3_ , 77) , which proves (15^ and allows to conclude the proof of Lemma [T] ■ 



Proof of Theorem [TJ Similarly to the proof of Theorem [21 we consider the event $7^ of probability at least 
1 — Lcjvi,Ap'^^^ for all n > Uq ((GSA)), on which: holds and 

• For all models M € Ain of dimension Dm such that Am,+ (Inn)^ < Dm it holds 

|pi (M) - E [p2 (M)] I < L(GSA)£« (M) E [p2 (M)] , (81) 
\P2 (M) - E [p2 (M)] I < L(GSA)4 (M) E [p2 (M)] . (82) 

• For all models M G A^„ with Da/ < Am.+ (Inn)^ it holds 

P2 M <i(GSA)^ -■ 83 

n 

• For every M e A^„, 

|5(M)| <L(osA) . (84) 

Let d G (0, 1) to be chosen later. 

Lower bound on IJr^. Remind that M minimizes 

M 

crit' (M) = crit (M) - P„Ks^ = £ (s*, sa/) - P2 [M) + 5 (Af) + pen (Af) . (85) 
1. Lower bound on crit' (Af ) for "small" models : assume that M G M.n and 

Dm < dArichnilnny'^ . 

We have 

^(s*,SA/) + pen(A/) > (86) 
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and from dHU), as i {s^,sm) < 4^1^ by (Ab), we get on for all n > uq ((GSA),d), 



xtA^\ ^ r I . /^(s*,SAf)lnn Inn 

(M) > -L(GSA) 



n n 



Inn 

> -^(GSA)Y — 

>-dxA^A„ch{\nny^ . (87) 

Then, if Dm > Am,+ (Inn)^, as /Ci,m < 6yl by (Ab) and as, by dH) and for all n > no ((GSA)) it 
holds L(GSA)£n i^i) ^ 1; we deduce from ([5^ and Lemma|3]that for all n > ((GSA)), 

P2 (M) < 2E [p2 [M)] < 36 A^— <dx 36A^A„ch (Inn)"^ . 

n 

Whenever Dm < Am,+ (Inn)^, ((83)) gives that, for all n > uq {{GSA),d), on the event il^, 

P2 (M) < i(GSA)^^^^ <dx 36A^A„ch (Inn)"' . 
n 

Hence, we have checked that for all n > tiq ((GSA), d), on the event fi^, 

-P2 (M) >-dx 36 A^A„ch (lnn)~' , (88) 
and finally, by using (US]), (HZl) and (HH) in we deduce that on n'^, for aU n > no ((GSA), d), 

crit' (M) > -d X SrA^Ar^ch (Inn)"' . (89) 

2. There exists a better model for crit' (M) : By (P3), for all n > uq {Am,+ , Arich) a model Mi € A^„ 
exists such that ^ 

(Inn)' < ™\ < Dmi ■ 
(Inn) 



We then have on flL 



i (s„ SA/J < A„2 {\nnf^+ n'^^ by (Ap^ 



P2 (Ml) > (1 - L(GSA)4 (A^i)) E b2 (Afi)] by ([82 
pen (Ml ) < ^penE [p2 {Ml)] by i 



1 5 (Ml) I < L(GSA)\/— by dSl and (Ab) 



and therefore, 



crit' (Afi) < (-1 + Apen + L(GSA)e' (Ml)) E [P2 (Ml)] + i(GSA) V ^ + • (90) 



2/3+ 



Hence, as —1 + Apon < 0, and as by ([S]), (|15l) . (An) and Lemma|3]it holds for all n > uq ((GSA), Apon) 

i(GSA)4 (Ml) < and E [p^ (Ah)] > > (Inn)"' , 

we deduce from (lUU)) that on fij^, for all n > uq ((GSA), Apcn), 

crit' {Ml) < (1 - Apen) <inA„c/i (Inn)"' . (91) 



21 



Now, by taking 

and by comparing ([89]) and (|9T|). we deduce that on ilj^, for all n > no ((GSA), Apcn), for all M E Ain such 
that Dm < dArichn {Inn) ^, 

crit' (Ml) < crit' (M) 

and so 

IJjj > dAjch"- (lnn)~^ . (93) 

Excess Risk of s„ (''^^) ■ ^^'^^ with the value given in (|92l) . First notice that for all n > no {A-M.+, Arich, d) 

we have dArichn (Inn) ^ > Am,+ (Inn)^. Hence, for all M G jM„ such that _Dj\/ > dA^ichn (Inn) ^, by ([6]), 
P^ . (P2), (An) and LemmaH it holds on VL'^ for all n > no ((GSA), Apon), using (|5T|) . 

^(5.,s„ (A/)) > p, (M) > 4^:^ > (inn)-2 . 

2 n 2 

By (|93l) . we thus get that on $1',^, for all n > no ((GSA), Apon), 

^(s„s„(M))>^^^^(lnn)-^ . (94) 
Moreover, the model Mq defined in (P3) satisfies, for all n > no ((GSA)), 

Am,+ {Innf < \/n< Dm„ < CrichV^ 

and so using (Ap^,), 
In addition, by ([M)) . 

Pi [M) < (1 + i(GSA)£n (M)) E b2 (M)] . 

Hence, as /Ci,a/ < 6v4 by (Ab) and as, by ^ and P3)) . for all n > no ((GSA)) it holds (M) < 1, we deduce 
from Lemma m that for all n > no ((GSA)) 

Pi {M) < i(GSA)— ^ < i(GSA)"-^^^^ ■ 

^ ' n ^ ' 

By consequence, for all n > uq ((GSA)), 

£ (s*, s„ (Mo)) < L(GSA) (n-'^+/2 ^ ^^-1/2^ (95) 

and the ratio between the two bounds (|94l) and ([95]) is larger than In (n) for all n > no (-Zj(GSA)7^pcn), which 
yields dH). ■ 

Probabilistic Tools We recall here the main probabilistic results that are instrumental in our proofs. 
The following tool is the well known Bernstein's inequality, that can be found for example in [3B] , Proposition 
2.9. 

Theorem 8 (Bernstein's inequality) Let (Xi,...,X„) be independent real valued random variables and define 

1 " 

5= - V (X,-E[X,]) 

Assuming that 



n 

i=l 



1 



n 
1=1 
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and 



we have, for every x > 0, 



Xi < b a.s. 



\S\>j2v 



X bx 
n 3n 



< 2exp (—x) 



(96) 



We now turn to concentration inequalities for the empirical process around its mean. Bousquet's inequality 
[It] provides optimal constants for the deviations above the mean. Klein-Rio's inequality [20] gives sharp 
constants for the deviations below the mean, that slightly improves Klein's inequality |21| . 

Theorem 9 Let be n i.i.d. random variables having common law P and taking values in a mea- 

surable space Z . If J- is a class of measurable functions from Z to R- satisfying 



then, by setting 



we have, for all x > 0, 

Bousquet's inequality : 



\f{ii)-Pf\<b a.s., for all f e:F, i<n, 
a^ = sup{p(/2)-(P/)n, 



|P„ - - E [|1P„ - P|l^] > J2 [a^-, + 2hE [||F„ - 



bx 
n ' 3n 



and we can deduce that, for all e,x> 0, it holds 



1P„ - P|l^ - E [|1P„ - P||^] > ^/2a3.- + sE [|1P„ - P|l^] 







bx 






n 



Klein-Rio's inequality : 



E [IIP, - py] - \\Pn - py > \/2 (^2, + 2bE [||p„ - p||^]) ^ + ^ 



and again, we can deduce that, for all e,x > 0, it holds 



E [\\Pn ~ Py] - \\Pn -Py> Xl'^^T- + eE [||P„ - P||^] 



■2 ^ ^^wm-p _ Pii 1 _L i _|_ 1 ) ^ 



< exp (— x) 



< exp {~x) . 



< exp (— x) 



< exp {—x) . 



(97) 



(98) 



(99) 



(100) 
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